{
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "## HTTP 爬虫笔记\n",
        "\n",
        "> $Requests:\\ HTTP\\ for\\ Humans^{TM}$\n",
        "\n",
        "- Requests, 强大、简洁而优雅\n",
        "- 作者是**Kenneth Reitz**大牛，其才华和形象都令人印象深刻\n",
        "- 个人一般用来做区别于浏览器模拟的 HTTP 请求爬虫，速度当然就要比 selenium 快多了\n",
        "- 如果想更快就要用到 aiohttp, gevent 等异步 requests 版本了"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "import requests"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 伪装请求头\n",
        "\n",
        "首先要尽量伪装请求头（Request Headers），避免被 ban\n",
        "\n**这里还可以使用 faker 库生成指定类型的随机 User-Agent**"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "# 常用伪装请求头，根据人工正规登录后的信息修改后再使用。\n",
        "headers = {\n",
        "    'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',\n",
        "    'Accept-Encoding':'utf-8',\n",
        "    'Accept-Language':'zh-CN,zh;q=0.9,en;q=0.8',\n",
        "    'Cache-Control':'max-age=0',\n",
        "    'Connection':'keep-alive',\n",
        "#    'Cookie':'',\n",
        "    'Host':'',\n",
        "    'Referer':'',\n",
        "    'Upgrade-Insecure-Requests':'1',\n",
        "    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3573.0 Safari/537.36',\n",
        "}\n",
        "\ncookies = {'cookie_name':'cookie_value'}  # cookie可以写在 headers 里面，也可以单独另外给出"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 发起请求\n",
        "\n",
        "使用伪装请求头和登录 cookies 打开目标网站，绕过登录步骤。缺点是 cookie 需要每次手工登录获取，而且有效期不一定。\n",
        "\n还可以通过 params 参数传入一个 dict，用于拼接 URL 的查询参数"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "payload = {'a':'1'}\n",
        "r = requests.get('target_url', cookies=cookies, headers=headers, params = payload )\n",
        "print(r.url) # 结果为 target_url?a=1"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 页面分析\n",
        "\n",
        "当然可以使用`beautiful soap`, `etree`等包对 html 树结构进行分析。如果目标位置上下文的文本唯一显著，那也可以直接用如下方法\n",
        "\n假设想要获取的信息在文本\"anchor\"后的10位数字"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "anc = r.text.index('anchor') # 将索引原点钉在某个方便唯一定位的锚文本上\n",
        "target = r.text[anc+6:anc+16] # anchor第一个字母a位置为0"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "### 下载文件\n",
        "\n最直接的可以使用 Python 自带的 open 方法来读写文件"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "with open(file,'w',encoding='utf-8') as f: # 第二个参数换成 wb 即可二进制写入，但就不能指定编码了\n",
        "    f.write(yourcontent) # with open 结尾不需要 f.close()"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "优点是可以控制写入文件的全过程（特别是编码），必须指定文件名和文件类型。\n",
        "\n但有时候并不想这么做，只是想直接把下载链接给出的文件原样下载下来。那么就可以使用 urlretrieve"
      ],
      "metadata": {}
    },
    {
      "cell_type": "code",
      "source": [
        "import urllib.request # 另一个request, urlretrieve位于此模块内\n",
        "\nurllib.request.urlretrieve(download_url, file_path)"
      ],
      "outputs": [],
      "execution_count": null,
      "metadata": {}
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "language": "python",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python",
      "version": "3.6.5",
      "mimetype": "text/x-python",
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "pygments_lexer": "ipython3",
      "nbconvert_exporter": "python",
      "file_extension": ".py"
    },
    "kernel_info": {
      "name": "python3"
    },
    "nteract": {
      "version": "0.12.3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}